home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Languguage OS 2
/
Languguage OS II Version 10-94 (Knowledge Media)(1994).ISO
/
gnu
/
gptx-0_2.lha
/
gptx-0.2
/
gptx.info
< prev
next >
Wrap
Text File
|
1991-10-09
|
36KB
|
804 lines
Info file gptx.info, produced by Makeinfo, -*- Text -*- from input
file /usr2/pinard/gptx/0.2/doc/gptx.texi.
Copyright (C) 1990 Free Software Foundation, Inc. Francois Pinard
<pinard@iro.umontreal.ca>, 1988.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 1, or (at your option)
any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
File: gptx.info, Node: Top, Next: Usage, Up: (DIR)
`gptx' - GNU permuted index generator
=====================================
This is the 0.2 alpha release of `gptx', the GNU version of a
permuted index generator. This software has the main goal of providing
a `ptx' *almost* compatible replacement, able to handle small files
quickly, while providing a platform for more development.
This version reimplements and extends standard `ptx'. Among other
things, it can produce a readable "KWIC" (keywords in their context)
without the need of `nroff', there is also an option to produce TeX
compatible output. This version does not yet handle huge input files,
that is, those files which do not fit in memory all at once.
*Please note* that an overall renaming of all options is
foreseeable. In fact, GNU ptx specifications are not frozen yet.
* Menu:
* Usage:: How to use the program, its options and parameters.
* Regexps:: How a regular expression is written and used.
* ptx mode:: In which ways `ptx' mode is different.
* Future:: What are the development lines of this program.
File: gptx.info, Node: Usage, Next: Regexps, Prev: Top, Up: Top
How to use this program
-----------------------
This tool reads a text file and essentially produces a permuted
index, with each keyword in its context. The calling sketch is one of:
gptx [OPTION]... [INPUT]... >OUTPUT
or:
ptx [OPTION]... [INPUT [OUTPUT]]
These are two different versions of one program. When using `ptx'
instead of `gptx', this implies built-in `ptx' compatibility mode,
disallowing extensions, introducing some limitations, and changing
several of the program's default option values. This documentation
describes both modes of operation. See *Note ptx mode:: for an
explicit list of differences.
As usual, each option is represented by an hyphen followed by a
single letter. Some options require a parameter in the form of a
decimal number or a file name, in which case the parameter follows the
option after some whitespace. Option letters may be grouped and tied
together as a string which follows only one hyphen; if one of several
of them require parameters, they should follow the combined options in
the order of appearance of individual letters in the string.
Individual options are explained below.
When *not* in `ptx' compatibility mode, there may be zero, one or
several parameters after the options. If there is no parameters, the
program reads the standard input. If there is one or several
parameters, they give the name of input files, which are all read in
turn; as if all the input files were concatenated. However, there is a
full contextual break between each file; and when automatic referencing
is requested, file names and line numbers refer to individual text
input files. In all cases, the program produces the permuted index
onto the standard output.
When in `ptx' compatibility mode, besides the options, there may be
zero, one or two parameters. If there is no parameters, the program
reads the standard input and produces the permuted index onto the
standard output. If there is only one parameter, it names the text
file to be read instead of the standard input. If two parameters are
given, they give respectively the name of the file to read and the
name of the file to produce. *Be careful* to note that, in this case,
the contents of file given by the second parameter is destroyed; this
behaviour is dictated by compatibility; GNU standards discourage output
parameters not introduced by an option.
Note that for *any* file named as the value of an option or as an
input text file, a single dash `-' may be used, in which case standard
input is assumed. However, it would not make sense to use this
convention more than once per program invocation.
* Menu:
* General options:: Options which affect general program behaviour.
* Charset selection:: Underlying character set considerations.
* Input processing:: Input fields, contexts, and keyword selection.
* Output formatting:: Types of output format, and sizing the fields.
File: gptx.info, Node: General options, Next: Charset selection, Up: Usage
General options
...............
`-C'
Prints a short note about the Copyright and copying conditions.
File: gptx.info, Node: Charset selection, Next: Input processing, Prev: General options, Up: Usage
Charset selection
.................
As it is setup now, the program assumes that the input file is coded
using 8-bit ISO 8859-1 code, also known as Latin-1 character set,
*unless* if it is compiled for MS-DOS, in which case it uses the
character set of the IBM-PC. Compared to 7-bit ASCII, the set of
characters which are letters is then different, this fact alters the
behaviour of regular expression matching. Thus, the default regular
expression for a keyword allows foreign or diacriticized letters.
Keyword sorting, however, is still crude; it obeys the underlying
character set ordering quite blindly.
`-f'
Fold lower case letters to upper case for sorting.
File: gptx.info, Node: Input processing, Next: Output formatting, Prev: Charset selection, Up: Usage
Word selection
..............
`-b FILE'
This option is an alternative way to option `-W' for describing
which characters make up words. This option introduces the name
of a file which contains a list of characters which can*not* be
part of one word, this file is called the "Break file". Any
character which is not part of the Break file is a word
constituent. If both options `-b' and `-W' are specified, then
`-W' has precedence and `-b' is ignored.
In normal mode, the only way to avoid newline as a break
character is to write all the break characters in the file with
no newline at all, not even at the end of the file. In `ptx'
compatibility mode, spaces, tabs and newlines are always
considered as break characters even if not included in the Break
file.
`-i FILE'
The file associated with this option contains a list of words
which will never be taken as keywords in concordance output. It
is called the "Ignore file". The file contains exactly one word
in each line; the end of line separation of words is not subject
to the value of the `-S' option.
If not specified, there might be a default Ignore file. Default
Ignore files are not necessarily the same in normal mode or in
`ptx' compatibility mode. Unless changed by the local
installation, there is *no* default Ignore file in normal mode,
and the Ignore file is `/usr/lib/eign' in `ptx' compatibility
mode. If you want to deactivate a default Ignore file, use
`/dev/null' instead.
`-o FILE'
The file associated with this option contains a list of words
which will be retained in concordance output, any word not
mentionned in this file is ignored. The file is called the "Only
file". The file contains exactly one word in each line; the end
of line separation of words is not subject to the value of the
`-S' option.
There is no default for the Only file. In the case there are
both an Only file and an Ignore file, a word will be subject to
be a keyword only if it is given in the Only file and not given
in the Ignore file.
`-r'
On each input line, the leading sequence of non white characters
will be taken to be a reference that has the purpose of
identifying this input line on the produced permuted index. See
*Note Output formatting:: for more information about reference
production. Using this option change the default value for
option `-S'.
Using this option, the program does not try very hard to remove
references from contexts in output, but it succeeds in doing so
*when* the context ends exactly at the newline. If option `-r'
is used with `-S' default value, or when in `ptx' compatibility
mode, this condition is always met and references are completely
excluded from the output contexts.
`-S REGEXP'
This option selects which regular expression will describe the
end of a line or the end of a sentence. In fact, there is other
distinction between end of lines or end of sentences than the
effect of this regular expression, and input line boundaries have
no special significance outside this option. By default, in
`ptx' compatibility mode or if `-r' option is used, end of lines
are used; in this case, the REGEXP used is very simple:
\n
In normal mode and if `-r' option is not used, by default, end of
sentences are used; the precise REGEX is imported from GNU emacs:
[.?!][]\"')}]*\\($\\|\t\\| \\)[ \t\n]*
An empty REGEXP is equivalent to completly disabling end of line
or end of sentence recognition. In this case, the whole file is
considered to be a single big line or sentence. The user might
want to disallow all truncation flag generation as well, through
option `-F ""'. On regular expression writing and usage, see
*Note Regexps::.
When the keywords happen to be near the beginning of the input
line or sentence, this often creates an unused area at the
beginning of the output context line; when the keywords happen to
be near the end of the input line or sentence, this often creates
an unused area at the end of the output context line. The
program tries to fill those unused areas by wrapping around
context in them; the tail of the input line or sentence is used
to fill the unused area on the left of the output line; the head
of the input line or sentence is used to fill the unused area on
the right of the output line.
This option is not available when the program is operating `ptx'
compatibility mode.
`-W REGEXP'
This option selects which regular expression will describe each
keyword. By default, in `ptx' compatibility mode, a word is
anything which ends with a space, a tab or a newline; the REGEXP
used is `[^ \t\n]+'.
In normal mode, a word is a sequence of letters; the REGEXP used
is `\w+'.
An empty REGEXP is equivalent to not using this option, letting
the default dive in. On regular expression writing and usage, see
*Note Regexps::.
This option is not available when the program is operating `ptx'
compatibility mode.
File: gptx.info, Node: Output formatting, Prev: Input processing, Up: Usage
Output formatting
.................
Output format is mainly controlled by `-O' and `-T' options,
described in the table below. However, when neither `-O' nor `-T' is
selected, and if we are not running in `ptx' compatibility mode, the
program choose an output format suited for a dumb terminal. This is
the default format when working in normal mode. Each keyword
occurrence is output to the center of one line, surrounded by its left
and rigth contexts. Each field is properly justified, so the
concordance output could readily be observed. As a special feature,
if automatic references are selected by option `-A' and are output
before the left context, that is, if option `-R' is *not* selected,
then a colon is added after the reference; this nicely interfaces with
GNU Emacs `next-error' processing. In this default output format,
each white space character, like newline and tab, is merely changed to
exactly one space, with no special attempt to compress consecutive
spaces. This might change in the future. Except for those white
space characters, every other character of the underlying set of 256
characters is transmitted verbatim.
Output format is further controlled by the following options.
`-g NUMBER'
Select the size of the minimum white gap between the fields on
the output line.
`-w NUMBER'
Select the output maximum width of each final line. If
references are used, they are included or excluded from the
output maximum width depending on the value of option `-R'. If
this option is not selected, that is, when references are output
before the left context, the output maximum width takes into
account the maximum length of all references. If this options is
selected, that is, when references are output after the right
context, the output maximum width does not take into account the
space taken by references, nor the gap that precedes them.
`-A'
Select automatic references. Each input line will have an
automatic reference made up of the file name and the line
ordinal, with a single colon between them. However, the file
name will be empty when standard input is being read. If both
`-A' and `-r' are selected, then the input reference is still
read and skipped, but the automatic reference is used at output
time, overriding the input reference.
This option is not available when the program is operating `ptx'
compatibility mode.
`-R'
In default output format, when option `-R' is not used, any
reference produced by the effect of options `-r' or `-A' are
given to the far right of output lines, after the right context.
In default output format, when option `-R' is specified,
references are rather given to the beginning of each output line,
before the left context. For any other output format, option
`-R' is almost ignored, except for the fact that the width of
references is *not* taken into account in total output width
given by `-w' whenever `-R' is selected.
This option is not explicitely selectable when the program is
operating in `ptx' compatibility mode. However, in this case, it
is always implicitely selected.
`-F STRING'
This option will request that any truncation in the output be
reported using the string STRING. Most output fields
theoretically extend towards the beginning or the end of the
current line, or current sentence, as selected with option `-S'.
But there is a maximum allowed output line width, changeable
through option `-w', which is further divided into space for
various output fields. When a field has to be truncated because
cannot extend until the beginning or the end of the current line
to fit in the, then a truncation occurs. By default, the string
used is a single slash, as in `-F /'.
STRING may have more than one character, as in `-F ...'. Also,
in the particular case STRING is empty (`-F ""'), truncation
flagging is disabled, and no truncation marks are appended in
this case.
This option is not available when the program is operating `ptx'
compatibility mode.
`-O'
Choose an output format suitable for `nroff' or `troff'
processing. Each output line will look like:
.xx "TAIL" "BEFORE" "KEYWORD_AND_AFTER" "HEAD" "REF"
so it will be possible to write an `.xx' roff macro to take care
of the output typesetting. This is the default output format
when working in `ptx' compatibility mode.
In this output format, each non-graphical character, like newline
and tab, is merely changed to exactly one space, with no special
attempt to compress consecutive spaces. Each quote character:
`"' is doubled so it will be correctly processed by `nroff' or
`troff'. All characters having their eight bit set are turned
into spaces in this version. It is expectable that diacriticized
characters will be correctly expressed in `roff' terms if I learn
how to do this. So, let me know how to improve this special
character processing.
This option is not available when the program is operating `ptx'
compatibility mode. In fact, it then becomes the default and
sole output format.
`-T'
Choose an output format suitable for TeX processing. Each output
line will look like:
\xx {TAIL}{BEFORE}{KEYWORD}{AFTER}{HEAD}{REF}
so it will be possible to write write a `\xx' definition to take
care of the output typesetting. Note that when references are
not being produced, that is, neither option `-A' nor option `-r'
is selected, the last parameter of each `\xx' call is inhibited.
In this output format, some special characters, like `$', `%',
`&', `#' and `_' are automatically protected with a backslash.
Curly brackets `{', `}' are also protected with a backslash, but
also enclosed in a pair of dollar signs to force mathematical
mode. The backslash itself produces the sequence `\backslash{}'.
Circumflex and tilde diacritics produce the sequence `^\{ }' and
`~\{ }' respectively. Other diacriticized characters of the
underlying character set produce an appropriate TeX sequence as
far as possible. The other non-graphical characters, like
newline and tab, and all others characters which are not part of
ASCII, are merely changed to exactly one space, with no special
attempt to compress consecutive spaces. Let me know how to
improve this special character processing for TeX.
This option is not available when the program is operating `ptx'
compatibility mode.
File: gptx.info, Node: Regexps, Next: ptx mode, Prev: Usage, Up: Top
Syntax of Regular Expressions
-----------------------------
Regular expressions have a syntax in which a few characters are
special constructs and the rest are "ordinary". An ordinary character
is a simple regular expression which matches that character and
nothing else. The special characters are `$', `^', `.', `*', `+',
`?', `[', `]' and `\'; no new special characters will be defined. Any
other character appearing in a regular expression is ordinary, unless
a `\' precedes it.
For example, `f' is not a special character, so it is ordinary, and
therefore `f' is a regular expression that matches the string `f' and
no other string. (It does not match the string `ff'.) Likewise, `o'
is a regular expression that matches only `o'.
Any two regular expressions A and B can be concatenated. The
result is a regular expression which matches a string if A matches
some amount of the beginning of that string and B matches the rest of
the string.
As a simple example, we can concatenate the regular expressions `f'
and `o' to get the regular expression `fo', which matches only the
string `fo'. Still trivial. To do something nontrivial, you need to
use one of the special characters. Here is a list of them.
`. (Period)'
is a special character that matches any single character except a
newline. Using concatenation, we can make regular expressions
like `a.b' which matches any three-character string which begins
with `a' and ends with `b'.
`*'
is not a construct by itself; it is a suffix, which means the
preceding regular expression is to be repeated as many times as
possible. In `fo*', the `*' applies to the `o', so `fo*' matches
one `f' followed by any number of `o's. The case of zero `o's is
allowed: `fo*' does match `f'.
`*' always applies to the smallest possible preceding expression.
Thus, `fo*' has a repeating `o', not a repeating `fo'.
The matcher processes a `*' construct by matching, immediately,
as many repetitions as can be found. Then it continues with the
rest of the pattern. If that fails, backtracking occurs,
discarding some of the matches of the `*'-modified construct in
case that makes it possible to match the rest of the pattern.
For example, matching `ca*ar' against the string `caaar', the
`a*' first tries to match all three `a's; but the rest of the
pattern is `ar' and there is only `r' left to match, so this try
fails. The next alternative is for `a*' to match only two `a's.
With this choice, the rest of the regexp matches successfully.
`+'
Is a suffix character similar to `*' except that it requires that
the preceding expression be matched at least once. So, for
example, `ca+r' will match the strings `car' and `caaaar' but not
the string `cr', whereas `ca*r' would match all three strings.
`?'
Is a suffix character similar to `*' except that it can match the
preceding expression either once or not at all. For example,
`ca?r' will match `car' or `cr'; nothing else.
`[ ... ]'
`[' begins a "character set", which is terminated by a `]'. In
the simplest case, the characters between the two form the set.
Thus, `[ad]' matches either one `a' or one `d', and `[ad]*'
matches any string composed of just `a's and `d's (including the
empty string), from which it follows that `c[ad]*r' matches `cr',
`car', `cdr', `caddaar', etc.
Character ranges can also be included in a character set, by
writing two characters with a `-' between them. Thus, `[a-z]'
matches any lower-case letter. Ranges may be intermixed freely
with individual characters, as in `[a-z$%.]', which matches any
lower case letter or `$', `%' or period.
Note that the usual special characters are not special any more
inside a character set. A completely different set of special
characters exists inside character sets: `]', `-' and `^'.
To include a `]' in a character set, you must make it the first
character. For example, `[]a]' matches `]' or `a'. To include a
`-', write `--', which is a range containing only `-'. To
include `^', make it other than the first character in the set.
`[^ ... ]'
`[^' begins a "complement character set", which matches any
character except the ones specified. Thus, `[^a-z0-9A-Z]'
matches all characters except letters and digits.
`^' is not special in a character set unless it is the first
character. The character following the `^' is treated as if it
were first (`-' and `]' are not special there).
Note that a complement character set can match a newline, unless
newline is mentioned as one of the characters not to match.
`^'
is a special character that matches the empty string, but only if
at the beginning of a line in the text being matched. Otherwise
it fails to match anything. Thus, `^foo' matches a `foo' which
occurs at the beginning of a line.
`$'
is similar to `^' but matches only at the end of a line. Thus,
`xx*$' matches a string of one `x' or more at the end of a line.
`\'
has two functions: it quotes the special characters (including
`\'), and it introduces additional special constructs.
Because `\' quotes special characters, `\$' is a regular
expression which matches only `$', and `\[' is a regular
expression which matches only `[', and so on.
Note: for historical compatibility, special characters are treated
as ordinary ones if they are in contexts where their special meanings
make no sense. For example, `*foo' treats `*' as ordinary since there
is no preceding expression on which the `*' can act. It is poor
practice to depend on this behavior; better to quote the special
character anyway, regardless of where is appears.
For the most part, `\' followed by any character matches only that
character. However, there are several exceptions: characters which,
when preceded by `\', are special constructs. Such characters are
always ordinary when encountered on their own. Here is a table of `\'
constructs.
`\|'
specifies an alternative. Two regular expressions A and B with
`\|' in between form an expression that matches anything that
either A or B will match.
Thus, `foo\|bar' matches either `foo' or `bar' but no other
string.
`\|' applies to the largest possible surrounding expressions.
Only a surrounding `\( ... \)' grouping can limit the grouping
power of `\|'.
Full backtracking capability exists to handle multiple uses of
`\|'.
`\( ... \)'
is a grouping construct that serves three purposes:
1. To enclose a set of `\|' alternatives for other operations.
Thus, `\(foo\|bar\)x' matches either `foox' or `barx'.
2. To enclose a complicated expression for the postfix `*' to
operate on. Thus, `ba\(na\)*' matches `bananana', etc.,
with any (zero or more) number of `na' strings.
3. To mark a matched substring for future reference.
This last application is not a consequence of the idea of a
parenthetical grouping; it is a separate feature which happens to
be assigned as a second meaning to the same `\( ... \)' construct
because there is no conflict in practice between the two meanings.
Here is an explanation of this feature:
`\DIGIT'
after the end of a `\( ... \)' construct, the matcher remembers
the beginning and end of the text matched by that construct.
Then, later on in the regular expression, you can use `\'
followed by DIGIT to mean "match the same text matched the
DIGIT'th time by the `\( ... \)' construct."
The strings matching the first nine `\( ... \)' constructs
appearing in a regular expression are assigned numbers 1 through
9 in order that the open-parentheses appear in the regular
expression. `\1' through `\9' may be used to refer to the text
matched by the corresponding `\( ... \)' construct.
For example, `\(.*\)\1' matches any newline-free string that is
composed of two identical halves. The `\(.*\)' matches the first
half, which may be anything, but the `\1' that follows must match
the same exact text.
`\`'
matches the empty string, provided it is at the beginning of the
buffer.
`\''
matches the empty string, provided it is at the end of the buffer.
`\b'
matches the empty string, provided it is at the beginning or end
of a word. Thus, `\bfoo\b' matches any occurrence of `foo' as a
separate word. `\bballs?\b' matches `ball' or `balls' as a
separate word.
`\B'
matches the empty string, provided it is not at the beginning or
end of a word.
`\<'
matches the empty string, provided it is at the beginning of a
word.
`\>'
matches the empty string, provided it is at the end of a word.
`\w'
matches any word-constituent character. The editor syntax table
determines which characters these are.
`\W'
matches any character that is not a word-constituent.
Here is a complicated regexp, used by Emacs to recognize the end of
a sentence together with any whitespace that follows. It is given in
Lisp syntax to enable you to distinguish the spaces from the tab
characters. In Lisp syntax, the string constant begins and ends with
a double-quote. `\"' stands for a double-quote as part of the regexp,
`\\' for a backslash as part of the regexp, `\t' for a tab and `\n'
for a newline.
"[.?!][]\"')]*\\($\\|\t\\| \\)[ \t\n]*"
This contains four parts in succession: a character set matching
period, `?' or `!'; a character set matching close-brackets, quotes or
parentheses, repeated any number of times; an alternative in
backslash-parentheses that matches end-of-line, a tab or two spaces;
and a character set matching whitespace characters, repeated any
number of times.
File: gptx.info, Node: ptx mode, Next: Future, Prev: Regexps, Up: Top
`ptx' compatibility mode
------------------------
This section outlines the differences between this program and
standard `ptx'. For someone used to standard `ptx', here are some
points worth noticing when not using `ptx' compatibility mode:
* In normal mode, concordance output is not formatted for `troff' or
`nroff'. By default, output is rather formatted for a dumb
terminal. `troff' or `nroff' output may still be selected
through option `-O'.
* In normal mode, unless `-R' option is used, the maximum reference
width is subtracted from the total output line width. In `ptx'
compatibility mode, width of references are not taken into
account in the output line width computations.
* In normal mode, all 256 characters, even `NUL's, are read and
processed from input file with no adverse effect. No attempt is
made to limit this in `ptx' compatibility mode. However, standard
`ptx' does not accept 8-bit characters, a few control characters
are rejected, and the tilde `~' is condemned.
* In normal mode, input line length is limited by available memory.
No attempt is made to limit this in `ptx' compatibility mode.
However, standard `ptx' processes only the first 200 characters in
each line.
* In normal mode, the break (non-word) characters default to be
every character except letters. In `ptx' compatibility mode, the
break characters default to space, tab and newline only.
* In some circumstances, output lines are filled a little more
completely in normal mode than in `ptx' compatibility mode. Even
in `ptx' mode, there are some slight disposition glitches this
program does not completely reproduce, even if it comes quite
close.
* The Ignore file default in `ptx' compatibility mode is not the
same as in normal mode. In default installation, default Ignore
files are `/usr/lib/eign' in `ptx' compatibility mode, and
nothing in normal mode.
* Standard `ptx' disallows specifying both the Ignore file and the
Only file at the same time. This version allows both, and
specifying an Only file does not inhibit processing the Ignore
file.
File: gptx.info, Node: Future, Prev: ptx mode, Up: Top
Development guidelines
----------------------
This software is meant to evolve towards a concordance package for
GNU, which should ideally be able to tackle true, real, big
concordance jobs, while staying fast and of easy for little jobs.
Several packages of this kind are awfully slow, I'm trying to keep
speed in mind. I am interested in interactive query, but postpone
burdening myself too much too soon about it.
Here is a *What To Do Next* list, in expected execution order.
1. Increase short term usability:
* Support the program for the GNU community. As directed by
user comments, test and debug the whole thing more fully,
and on bigger examples. Solve portability glitches as long
as this do not induce too ugly things in the code.
* Provide sample macros in the documentation.
* Understand and mimic `-t' option, if I can.
* See how TeX mode could be made more useful, and if a texinfo
mode would mean something to someone.
* Sort keywords intelligently for Latin-1 code. See how to
interface this character set with various output formats.
Also, introduce options to inverse-sort and possibly to
reverse-sort.
* Improve speed for Ignore and Only tables. Consider hashing
instead of sorting. Consider playing with obstacks to
digest them.
* Provide better handling of format effectors obtained from
input, and also attempt white space compression on output
which would still maximize full output width usage.
2. Provide multiple language support.
Most of the boosting work should go along the line of fast
recognition of multiple and complex boundaries, which define
various `languages'. Each such language has its own rules for
words, sentences, paragraphs, and reporting requests. This is
less difficult than I first thought:
* Recognize language modifiers with each option. At least -b,
-i, -o, -W, -S, and also new language switcher options, will
have such modifiers. Modifiers on language switchers will
allow or disallow language transitions.
* Complete the transformation of underlying variables into
arrays in the code.
* Implement a heap of positions in the input file. There is
one entry in the heap for each compiled regexp; it is
initialized by a re_search after each regexp compile.
Regexps reschedule themselves in the heap when their
position passes while scanning input. In this way, looking
simultaneously for a lot of regexps should not be too
inefficient, once the scanning starts. If this works ok,
maybe consider accepting regexps in Only and Ignore tables.
* Merge with language processing boundary processing options,
really integrating -S processing as a special case. Maybe,
implement several level of boundaries. See how to implement
a stack of languages, for handling quotations. See if more
sophisticated references could be handled as another special
case of a language.
3. Tackle other aspects, in a more long term view:
* Add options for statistics, frequency lists, referencing,
and all other prescreening tools and subsidiary tasks of
concordance production.
* Develop an interactive mode. Even better, construct a GNU
emacs interface. I'm looking at Gene Myers
<gene@cs.arizona.edu> suffix arrays as a possible
implementation along those ideas.
* Implement hooks so word classification and tagging should be
merged in. See how to effectively hook in lemmatisation or
other morphological features. It is far from being clear by
now how to interface this correctly, so some experimentation
is mandatory.
* Profile and speed up the whole thing.
* Make it work on small address space machines. Consider
three levels of hugeness for files, and three corresponding
algorithms to make optimal use of memory. The first case is
when all the input files and all the word references fit in
memory: this is the case currently implemented. The second
case is when the files cannot fit all together in memory, but
the word references do. The third case is when even the
word references cannot fit in memory.
* There also are subsidiary developments for in-core
incremental sort routines as well as for external sort
packages. The need for more flexible sort packages comes
partly from the fact that linguists use kinds of keys which
compare in unusual and more sophisticated ways. GNU `sort'
has been released recently, and could evolve with `gptx'.
Tag Table:
Node: Top870
Node: Usage1976
Node: General options4984
Node: Charset selection5173
Node: Input processing5957
Node: Output formatting11403
Node: Regexps18240
Node: ptx mode28347
Node: Future30649
End Tag Table